Goto

Collaborating Authors

 new term



AT ask Level Case Study

Neural Information Processing Systems

This section illustrates how a model's performance may vary across different tasks associated with We analyzed the performance of Llama-3-Instruct-70B on the new term "wokely," The book's cover was described as wokely by several reviewers. A. it struggled to attract attention on the bookstore displays despite a B. many readers were enticed to buy it, strengthening its presence on C. readers were intrigued and the book's sales experienced an unexpected surge worldwide. D. the publisher decided to release a limited edition with a special In the previous sentence, does _ refer to A. Is this example in line with commonsense and grammatically correct? As observed, the model only answered correctly in the COMA task but failed in the other two tasks. In the COMA task, the model successfully inferred that "wokely" carries a negative connotation, Although the phrase "hard to find a satisfying These results provide a comprehensive evaluation of the model's understanding of the term "wokely."


NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates

Neural Information Processing Systems

However, existing benchmarks focus on outdated content and limited fields, facing difficulties in real-time updating and leaving new terms unexplored. To address this problem, we propose an adaptive benchmark, NewTerm, for real-time evaluation of new terms. We design a highly automated construction method to ensure high-quality benchmark construction with minimal human effort, allowing flexible updates for real-time information. Empirical results on various LLMs demonstrate over 20% performance reduction caused by new terms. Additionally, while updates to the knowledge cutoff of LLMs can cover some of the new terms, they are unable to generalize to more distant new terms. We also analyze which types of terms are more challenging and why LLMs struggle with new terms, paving the way for future research. Finally, we construct NewTerm 2022 and 2023 to evaluate the new terms updated each year and will continue updating annually. The benchmark and codes can be found at https://anonymous.4open.science/r/NewTerms.




WeTransfer says user content will not be used to train AI after backlash

The Guardian

The popular filesharing service WeTransfer has said user content will not be used to train artificial intelligence after a change in its service terms had triggered a public backlash. The company, which is regularly used by creative professionals to transfer their work online, had suggested in new terms that uploaded files could be used to "improve machine learning models". The clause had previously said the service had a right to "reproduce, modify, distribute and publicly display" content, and the updated version caused confusion among users. A WeTransfer spokesperson said user content had never been used, even internally, to test or develop AI models and that "no specific kind of AI" was being considered for use by the Dutch company. The firm said: "There's no change in how WeTransfer handles your content in practice."


NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates

Neural Information Processing Systems

However, existing benchmarks focus on outdated content and limited fields, facing difficulties in real-time updating and leaving new terms unexplored. To address this problem, we propose an adaptive benchmark, NewTerm, for real-time evaluation of new terms. We design a highly automated construction method to ensure high-quality benchmark construction with minimal human effort, allowing flexible updates for real-time information. Empirical results on various LLMs demonstrate over 20% performance reduction caused by new terms. Additionally, while updates to the knowledge cutoff of LLMs can cover some of the new terms, they are unable to generalize to more distant new terms.


NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates

Deng, Hexuan, Jiao, Wenxiang, Liu, Xuebo, Zhang, Min, Tu, Zhaopeng

arXiv.org Artificial Intelligence

However, existing benchmarks focus on outdated content and limited fields, facing difficulties in real-time updating and leaving new terms unexplored. To address this problem, we propose an adaptive benchmark, NewTerm, for real-time evaluation of new terms. We design a highly automated construction method to ensure high-quality benchmark construction with minimal human effort, allowing flexible updates for real-time information. Empirical results on various LLMs demonstrate over 20% performance reduction caused by new terms. Additionally, while updates to the knowledge cutoff of LLMs can cover some of the new terms, they are unable to generalize to more distant new terms. We also analyze which types of terms are more challenging and why LLMs struggle with new terms, paving the way for future research. Finally, we construct NewTerm 2022 and 2023 to evaluate the new terms updated each year and will continue updating annually.


Single Ground Truth Is Not Enough: Add Linguistic Variability to Aspect-based Sentiment Analysis Evaluation

Yang, Soyoung, Cho, Hojun, Lee, Jiyoung, Yoon, Sohee, Choi, Edward, Choo, Jaegul, Cho, Won Ik

arXiv.org Artificial Intelligence

Aspect-based sentiment analysis (ABSA) is the challenging task of extracting sentiment along with its corresponding aspects and opinions from human language. Due to the inherent variability of natural language, aspect and opinion terms can be expressed in various surface forms, making their accurate identification complex. Current evaluation methods for this task often restrict answers to a single ground truth, penalizing semantically equivalent predictions that differ in surface form. To address this limitation, we propose a novel, fully automated pipeline that augments existing test sets with alternative valid responses for aspect and opinion terms. This approach enables a fairer assessment of language models by accommodating linguistic diversity, resulting in higher human agreement than single-answer test sets (up to 10%p improvement in Kendall's Tau score). Our experimental results demonstrate that Large Language Models (LLMs) show substantial performance improvements over T5 models when evaluated using our augmented test set, suggesting that LLMs' capabilities in ABSA tasks may have been underestimated. This work contributes to a more comprehensive evaluation framework for ABSA, potentially leading to more accurate assessments of model performance in information extraction tasks, particularly those involving span extraction.


Dictionary.com's Largest Update (Re)defines Thousands Of Words, Focusing On Identity

NPR Technology

Dictionary.com has updated thousands of entries to reflect the changing use of language in 2020, particularly in subjects like race, gender, health, technology and politics. Dictionary.com has updated thousands of entries to reflect the changing use of language in 2020, particularly in subjects like race, gender, health, technology and politics. Anyone grasping for the right word or phrase to describe life in 2020 now has a larger lexicon to work with. Dictionary.com has updated thousands of entries and added hundreds of words in its largest release to date, a reflection of the ways in which society and language have evolved even in just the past few months. The digital dictionary announced earlier this week that it updated more than 15,000 entries and added 650 brand new terms.